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Abstract 

Traditionally, stochastic approximation (SA) schemes have been popular choices for solving stochastic opti- 
mization problems. However, the performance of standard SA implementations can vary significantly based on the 
choice of the steplength sequence, and in general, little guidance is provided about good choices. Motivated by this 
gap, in the first part of the paper, we present two adaptive steplength schemes for strongly convex differentiable 
stochastic optimization problems, equipped with convergence theory, that aim to overcome some of the reliance on 
user-specific parameters. Of these, the first scheme, referred to as a recursive steplength stochastic approximation 
(RSA) scheme, optimizes the error bounds to derive a rule that expresses the steplength at a given iteration as 
a simple function of the steplength at the previous iteration and certain problem parameters. The second scheme, 
termed as a cascading steplength stochastic approximation (CSA) scheme, maintains the steplength sequence as a 
piecewise-constant decreasing function with the reduction in the steplength occurring when a suitable error threshold 
is met. 

In the second part of the paper, we allow for nondifferentiable objectives but with bounded subgradients over a 
certain domain. In such a regime, we propose a local smoothing technique, based on random local perturbations of 
the objective function, that leads to a differentiable approximation of the function. Assuming a uniform distribution 
on the local randomness, we establish a Lipschitzian property for the gradient of the approximation and prove that the 
obtained Lipschitz bound grows at a modest rate with problem size. This facilitates the development of an adaptive 
steplength stochastic approximation framework, which now requires sampling in the product space of the original 
measure and the artificially introduced distribution. The resulting adaptive steplength schemes are applied to three 
stochastic optimization problems. In particular, we observe that both schemes perform well in practice and display 
markedly less reliance on user-defined parameters. 

I. INTRODUCTION 

The use of stochastic gradient and subgradient schemes for the solution of stochastic convex optimization 
problems has a long tradition, beginning with an iterative scheme, first proposed by Robbins and Monro [1], 
that relied primarily on noisy gradient observations. Research by Ermoliev and his coauthors [2] -[5] focused 
largely on quasigradient (subgradient) methods and considered a host of stochastic programming problems, amongst 
them being two-period recourse-based problems (see [6]). To accelerate the convergence of stochastic subgradient 
methods, ergodic sequences, arising from the averaging of iterates, have been employed in [7]-[10]. Often gradient 
computations are either costly or unavailable; in such instances, a finite-difference approximation of the gradient 
can be constructed as first observed by Kiefer and Wolfowitz [11]. While standard finite-difference techniques 
perturb one direction at a time to obtain gradient estimates, simultaneous perturbation stochastic approximation 
techniques simultaneously perturb all directions and general require fewer function evaluations [12], [13]. More 
recently, there has been a significant interest in the application of ODE-based methods for investigating the stability 
and convergence of the associated stochastic approximation schemes [14], [15]. An elegant exposition of these 
methods may be found in the monographs by Polyak [16], Kushner and Yin [17], and Borkar [15]. 
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the award NSF CMMI 0948905 ARRA. 



Sample-average approximation (SAA) techniques [18] are often viewed as an alternative to stochastic approxima- 
tion techniques and are particularly attractive when approximate solutions to the problem are desired in an offline 
manner. This approach relies on using a sample from the underlying distribution to construct a deterministic sample- 
average problem, which can be subsequently solved via standard nonlinear programming solvers, as seen in [19]. 
In [10], the authors demonstrate that stochastic approximation schemes are shown to be competitive with SAA 
techniques. Importantly in [10], Nemirovski et al. develop a robust SA scheme that determines an optimal constant 
steplength for minimizing the theoretical error over a pre- specified number of steps. Mirror-descent generalizations 
of SA, that rely on a suitably defined prox-mapping, are also presented in [10] (also see [20]), while validation 
analysis is provided in [21]. 

Stochastic gradient algorithms have also been found to be effective in solving large deterministic problems such 
as convex feasibility problems [9], [22], [23], feasibility problems arising in control [24], [25] and some specially 
structured large-scale convex problems in [26]-[28]. Distributed consensus-based stochastic subgradient methods for 
minimizing a convex objective over a network have been recently developed and studied in [29]-[31]. The success 
of gradient-based methods in solving monotone variational inequalities [32] has prompted the study of similar 
techniques for contending with stochastic variational inequalities. In fact, Jiang and Xu [33] develop precisely 
such a scheme for the solution of strongly monotone stochastic variational inequalities and regularized variants 
were presented in [34] to allow for application to monotone stochastic variational inequalities. Finally, stochastic 
generalizations of the mirror-prox schemes were examined in [35] and allowed for the solution of monotone 
variational inequalities. 

While stochastic approximation schemes have proved successful, other avenues exist for addressing stochastic 
programs. For instance, an alternate approach lies in using sample-average approximation methods, that obtain 
estimators to the optimal value and solution of the problem through the solution of deterministic problem in which 
the expectation is replaced by a sample-average. Convergence theory for the obtained estimators is examined by 
Shapiro [18]. Decomposition schemes, that leverage cutting-plane methods, have also been particularly successful 
in addressing two-period stochastic linear [36], convex [37] and nonconvex programs [38] while a scalable matrix- 
splitting decomposition scheme is presented in [39] for two-period stochastic Nash games. 

In this paper, we consider adaptive stochastic gradient and subgradient methods for solving constrained stochastic 
convex optimization problems. The novelty of our work can be categorized as follows: (1) the development and 
analysis of two adaptive stepsize rules; and (2) the development of a local function smoothing technique. Next, we 
provide some motivation and a more elaborate description of each. 

In stochastic gradient methods (cf. [2]-[5], [14]-[17], [40]), the almost-sure convergence of such methods is 
guaranteed assuming that the stepsize is diminishing but not too rapidly, i.e., the stepsize is proportional to ^ with 
I < a < 1. Typically, there is no guidance on the specific choice of the sequence and problem parameters play little 
role in refining this choice. In contrast, in this paper, we propose specific (adaptive) rules for the stepsize values 
that exploit the information about the objective function. Accordingly, our first goal lies in examining whether one 
can construct a convergent scheme under an adaptive stepsize rule that is more reflective of the problem setting. 
Through out this part of the paper, we assume that the integrand of the expectation is a random convex differentiable 
function. Under a Lipschitzian assumption on the gradient, we propose two different adaptive stepsize rules: 

(a) Recursive stepsize rule: In attempting to minimize the bound on the expected error, we develop a recursive 
scheme for specifying the stepsize that requires only the steplength at the previous parameter and some 
problem parameters. Global convergence and rate estimates for this scheme are developed. 

(b) Cascading stepsize rule: It is well-known that under suitable assumptions, fixed-stepsize schemes are guar- 
anteed to converge to a compact region containing the solution set of the original problem. We consider 
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a modified version of such a scheme where the trajectory moves to successively smaller compact regions 
containing the solution sets. Furthermore, as soon as the trajectory of iterates reaches within a bound of the 
solution set, the steplength is updated allowing the sequence to make further progress. Effectively, we consider 
a method in which the steplength sequence can be viewed as one where the stepsize is maintained constant 
with drops or cascades in stepsize occurring at particular epochs. While the scheme has intuitive appeal, we 
provide a theoretical support for the convergence of such an algorithmic framework. 
When the random integrands arising in such stochastic problems are nonsmooth, direct application of known SA 
schemes is impossible. Contending with nonsmoothness in mathematical programming is often managed through 
avenues that rely on the solution of a sequence of smoothed problems (cf. [41], [42]). In a stochastic regime, 
an approach for addressing such problems is through a technique of global smoothing, as considered in [43] and 
more recently in [44].^ This involves modifying the original problem by adding a random variable with possibly 
unbounded support. However, such a technique is not feasible in when the objective is defined over a restricted 
domain. We present a local smoothing technique which leads to a globally differentiable approximation of the 
original function with Lipschitz continuous gradients. Furthermore, through such a smoothing, we derive a Lipschitz 
constant for the gradients and show that the constant grows at the rate of ^/n where n is the dimensionality of 
the problem space. Importantly, this Lipschitzian property facilitates the construction of a stochastic approximation 
framework. Consequently, the second part of the paper focuses on computing solutions to approximations with 
smoothed integrands whose gradients are shown to be provably Lipschitz continuous. 

The remainder of the paper is organized as follows. In Section II, we establish the almost-sure convergence of the 
classical stochastic approximation algorithm for a constrained problem with a differentiable convex function with 
Lipschitz gradients. In Sections III and IV, for a strongly convex function, we propose and analyze two different 
stepsize rules, each motivated by a minimization of an estimate on the expected error per iteration of the method. 
In Section V, we introduce a local randomized smoothing technique for nondifferentiable convex optimization, and 
derive its approximation properties as well as a bound on the Lipschitz constant of the gradients. In Section VI, 
we report some numerical results obtained by applying our proposed stepsize rules and the smoothing technique to 
three test problems and conclude with a discussion in Section VII. 

Notation and basic terminology: We view vectors as columns, and write to denote the transpose of a vector 
X. We use \\x\\ to denote the Euclidean vector norm, i.e., \\x\\ = Vx^x. We write Ilx{x) to denote the Euclidean 
projection of a vector x on a set X. i.e., \\x — Ilx{x)\\ = miny^x \\x — y\\. For a convex function / with domain 
dom/, a vector ^ is a subgradient of f at x e dom/ if the following relation holds^: 

f{x) -\- g^{x — x) < f{x) for all x G dom/. 

The subdifferential set of f at x = x, denoted by df{x), is the set of all subgradients of f at x = x. Finally, we 
write a.s. for "almost surely", and use Prob(Z) and E[Z] to denote the probability of an event Z and the expectation 
of a random variable Z, respectively. 

II. Problem Formulation and Background 

In this section, we begin by describing the problem and iterative scheme of interest (Section II- A). This is 
followed by Section II-B where we provide a short description on various adaptive schemes in the realm of 
stochastic approximation. 

^See [45] for a scheme that develops an approximation method for addressing a class of separable piecewise-linear stochastic optimization 
problems with integer breakpoints. 

^For a differentiable convex /, the inequality holds with g = Vf{x). 
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A. Problem Formulation 

We consider the following stochastic optimization problem 

min/(x) = E[F(x,0], (D 

where F :V xQ ^Ris 3. function, the set P C is open, and the set X is nonempty with X cV. The vector 
^ : 1] ^ is a random vector with a probability distribution on a set 1] C R^, while the expectation E[F(x,^)] 
is taken with respect to ^. We use X* to denote the optimal set of problem (1) and /* to denote its optimal value. 
We assume the following: 

Assumption 1: The set X C P is convex and closed. The function convex on V for every ^ G 1^, and 
the expected value E[F(x,^)] is finite for every x eV. 

Under Assumption 1, the function / is convex over X and the following relation holds 

df{x) = E[d^F{x, 0] for all xeV, (2) 

where dxF{x,£_) denotes the set of all subgradients of with respect to the variable x (see [46], [47]^). 

First, we will consider problem (1) where / is a differentiable function with Lipschitz gradients. Later, we 
will allow the function / to be nondifferentiable and we will consider a local smoothing technique yielding a 
differentiable function that approximates / over X. For this reason, we start our discussion by focusing on a 
differentiable problem (1) and the following iterative algorithm: 

x/e+i = Ux {xk - 7/c(V/(x/e) + Wk)) fov all k>0, 

(3) 

Wk = Va,F(x/e,^/e) - V/(x/e). 

Here, xq e X is a. random initial point, 7/c > is a (deterministic) stepsize, and Wk is the random vector given 
by the difference between the sampled gradient Va^F(x,^/c) and its expectation E[VccF(x,^)] evaluated at x = Xk- 
Throughout the paper, we assume that E[||xo|p] < oo. 

We let denote the history of the method up to time k, i.e., = {^o, ^o, Ci^ • • • ^ Cfc-i} for /c > 1 and 
J~'o = {^o}- By Assumption 1 and relation (2), it follows that \/f{xk) = E[\/xF{xkj^)] for a differentiable F, 
implying that Wk has zero-mean, i.e., 

E[wk \J^k]=0 for all k>0. (4) 

Next, we state some additional assumptions on the stochastic gradient error Wk and the stepsize jk- 
Assumption 2: The stepsize is such that 7/c > for all k. Furthermore, the following hold: 

(a) E^o 7fc = oo and J^^^ 7I < 00. 

(b) The stochastic errors Wk satisfy Xlfc^o [ll^^ll^ I ^ ^ almost surely. 

Assumption 2(b) is satisfied, for example, when XI^q 7fc ^ ^ error Wk is bounded almost surely, i.e., 

Il^fcll < c for all A: and some scalar c almost surely. 

We use the following Lemma in establishing the convergence of method (3) (see [16], page 50). 

Lemma 1: (Robbins-Siegmund) Let v^, o^ki and be nonnegative random variables, and let the following 
relations hold almost surely: 

00 00 

Vk+il^k < (1 + - ^/c +/^/c for all /c, ^afc<oo, < 00, 

k=0 /c=0 

^In both of these articles, the analysis is for a function defined over x Q, but can be extended to the case of a function defined over 
V X Q for an open convex set V CR'^. 



4 



where J^k denotes the collection vq, • • • , ^/e. uq^ . . . ^Uk, ao, • • • , Q^/c, Po-, • • • -iPk- Then, almost surely we have 

oo 

lim Vk = Uk < ^1 

k=0 

where v > is some random variable. 

We also make use of the following result, which can be found in [16] (see Lemma 11 in page 50). 

Lemma 2: Let {v^} be a sequence of nonnegative random variables, where E[vo] < oo, and let {a^} and {(3^} 
be deterministic scalar sequences such that: 

E[v/c+i|vo, . . . , v/c] < (1 - ak)vk + Pk ci'S for all /c > 0, 

oo oo ^ 

< a/e < 1, > 0, a/e = OO, < oo, lim — =0. 

k=0 k^O ^ 

Then, ^ almost surely, lim/^^oo E[vk] = 0, and for any e > 0, 



Prob(v^ < e for all j > A:) > 1 - - E[vk] + V ft for all k > 0. 



In Sections III and IV, we examine adaptive steplength schemes for a strongly convex function / whose gradients 
V/ are Lipschitz continuous over X with constant L. X is defined as 

B. Adaptive Stochastic Approximation Schemes 

Robbins and Monro [1] proposed the first stochastic approximation algorithm in 1951 while Kiefer and Wol- 
fowitz [11] proposed a variant of this scheme in which finite differences were employed to estimate the gradient. 
Asymptotic distributions of the Robbins -Monro scheme were first examined by Chung [48], leading to an asymptotic 
normality result in the one-dimensional regime while generalizations were subsequently studied by Sacks [49]. 

A potential challenge in developing efficient implementations of stochastic approximation implementations lies 
in choosing an appropriate steplength sequence. Kesten [50], in 1957, suggested a technique where the steplength 
sequence adapts to the observed data, which was further extended by Kushner and Gavin [51] to the multi- 
dimensional regime, while its accelerations were studied in [52]. Sacks [49] proved that, under suitable conditions, 
a choice of the form a/k (where k is the iterate index) is optimal from the standpoint of minimizing the asymptotic 
variance. Yet, the challenge lies in estimating the "optimal" a. Subsequently, Ventner [53] in what is possibly 
amongst the first adaptive steplength SA schemes, considered sequences of the form ak/k where ak is updated 
by leveraging past information. Notably, Chung [48] also examined the asymptotic variance properties of SA 
when steplength choices of the form a/k^~^ with a < ^ are used. In related work on adaptive schemes, Lai 
and Robbins [54] considered schemes of the form ak/k where ak is a strongly consistent estimator of V f{x) 
in a stochastic root-finding problem. One choice for obtaining a/^ is through the use of least-squares estimators. 
Multivariate generalizations of this analysis were suggested by Wei [55] in 1987 and again, it was observed that 
the Jacobian of the vector function assumes relevance in constructing efficient steplength sequences. 

An alternative to using a single sample was suggested by Spall [12] and relied on obtaining gradient estimates 
through a simultaneous perturbation of all the parameters. An adaptive generalization of this scheme, proposed by 
the same author [56], [57], employed an additional recursion to the standard projected gradient step that attempted to 
estimate the Jacobian in root finding problems or the Hessian in optimization problems. Accordingly, the modified 
update rule is of the form 

Xk+i = Hx {xk - ^kH^^iy f{xk) + Wk)) , (5) 
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where Hj^ is an estimate of the Hessian matrix of the objective. Clearly, this also falls under the regime of an 
adaptive steplength scheme. Related adaptive schemes may also be found in the work by Bhatnagar [58], [59]. 

A final remark is in order regarding the key difference between our proposed schemes and past work. A majority 
of the adaptive schemes in the literature employ past information to update the steplength. One such avenue involves 
developing estimates of the Hessian which is subsequently used in scaling the gradient step appropriately. In the 
sections to appear, we consider two very different approaches that are linked by a crucial property: they rely on 
using algorithm and problem parameters, and not sample points, to develop adaptive steplength schemes. 

C. Smoothing Techniques 

One of the goals of this paper is to address stochastic optimization problems with nonsmooth integrands. Here, we 
provide some background for accommodating nonsmoothness in optimization problems. In deterministic regimes, 
subgradient methods and their incremental variants have proved popular (see [26], [27], [60]), as have bundle 
methods [61], amongst others. One approach for managing nonsmoothness is through smoothing approaches. For 
instance, such avenues have allowed for the solution of variational inequalities and complementarity problems [32] 
as well as mathematical programs with equilibrium constraints [62]. 

In this paper, we also adopt a smoothing technique which bears little similarity to such approaches. We adopt a 
framework that can be traced back to a class of averaged functions introduced by Steklov [63], [64] in 1907. A 
general definition of such an averaging over possibly discontinuous functions is provided next [65]. 

Definition 1: Given a locally integrable function / : ^ R and a family of moUifiers {p^ : R^ R+, e > 0} 
that satisfy 

/ pe{z)dz = 1, supp(pe) := {z G W I Pe{z) > 0} C PeB with pe I as e I 0, 
where B is a unit ball in R"^. Then the associated family {/e, e > 0} of averaged functions is defined by 

/e := / f{x ^ Z)pe{z)dz = / f {z)pe{x - z)dz . 

Jr^ Jr^ 

In effect, the mollifier is a probability density function and the family of smoothed approximations, denoted by 
{/e,e > 0} must possess a host of convergence properties with respect to / as e ^ 0. For instance, if / is a 
continuous function then converges uniformly to / on every bounded subset of R^. In the absence of continuity, 
this cannot be guaranteed; yet, we may draw on epi-convergence results [66] for this class of functions may be 
employed in an effort to establish convergence of the infima/minima. These averaging functions have allowed for 
solving convex nondifferentiable optimization problems [67], [68] and discontinuous optimization problems [69], 
by minimizing a sequence of averaged or smoothed functions. 

We pursue an alternative to solving a sequence of smoothed problems and obtain an approximate solution by 
solving a single smoothed problem with a fixed e akin to that employed by Lakshmanan and Farias [44] . However, 
since we intend to leverage stochastic approximation schemes of the form described earlier in this paper, Lipschitz 
constants associated with the gradients are a requiem. In [44], the authors obtain Lipschitz constants assuming that 
the averaging is achieved through a normal distribution that requires the function be defined everywhere. Instead 
of "globally smoothing" the function, we employ a uniform distribution, referred to as "local smoothing." 

III. A RECURSIVE STEPLENGTH STOCHASTIC APPROXIMATION SCHEME 

In this section, we introduce an adaptive stochastic approximation scheme that overcomes certain challenges 
associated with implementing standard diminishing steplength schemes and relies on the use of a recursive rule 
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for prescribing steplengths. We begin by examining the standard stochastic gradient method for problem (1) in 
Section III- A. In general, the convergence of this scheme is guaranteed under the requirement that X^^^ 7/e = oo 
and YlV=Q^k < A host of choices exists with one possible choice being 7/. = 0/k. Yet, the choice of 
the appropriate 6 can have a significant impact on the performance of the algorithm. Motivated by the desire to 
minimize the "expected error," we develop a recursive stochastic approximation algorithm (referred to as the RSA 
scheme) in which the steplength at a particular iteration is a function of the steplength at the previous iteration and 
some problem parameters. In Section III-B, we motivate and introduce such a scheme and proceed to develop the 
associated convergence theory in Section III-C. 

A. Preliminaries 

We consider method (3) as applied to problem (1) where / has Lipschitz gradients. The method generates a 
sequence of iterates that converge to an optimal solution almost- surely, as shown in the forthcoming proposition. 
This result is a straightforward extension of Theorem 1 in [16, Pg. 51] which pertains to an unconstrained problem. 

Proposition 1 (Almost-sure convergence): Let Assumptions 1-2 hold, and let / be differentiable over the set X 
with Lipschitz gradients. Assume that the optimal set X* of problem (1) is nonempty. Then, the sequence {xk} 
generated by (3) converges almost surely to some random point in X*. 

Proof: By definition of the method and the nonexpansive property of the projection operation, we obtain for 
any G X* and /c > 0, 

\\xk+i - < \W - ^* - 7fc(V/(xfc) + wu)f 

= \\xk - - 27fc(V/(xfc) + Wkf{xk - X*) + ll\\Vf{xk) + Wkf. 
By the convexity of / and the gradient inequality, we have 

\\xk+i - < \\xk - x-'f - 2jk{f{xk) - f{xn) - "^Ikwlixk - X*) + 7'l|V/(xfe) + Wkf. 

Since \\a + b\\^ < 2\\a\\^ + 2||6|p for any a, 6 G R"^, by using /* = /(x*), and by adding and subtracting V/(x*) 
in the last term, we obtain 

\\xk+i - ^*||' < \\xk - x'^f - 2-fk{f{xk) - n - 2jkwl{xk - X*) + 27^||V/(xfc) - V/(x*)|p 
+ 27,'||V/(x*)+^fc||^ 

Taking the conditional expectation given Tj^, using E[wk \ Tk] = (see Eq. (4)) and the Lipschitzian property of 
the gradient, we have 

E[\\xk+i - x*f I J-fe] < (1 + 2Lhl)\\xk - x*f - 27fc(/(xfc) - /*) 
+ 27|(||V/(x*)f + E[|Kin J-fc]). 

Under Assumption 2, the conditions of Lemma 1 are satisfied. Therefore, almost surely, the sequence {||x/e+i — 
x*||} is convergent for any x* G X* and YlT^{)lk{f{xk) — /*) < 00. The former relation implies that {xk} is 
bounded a.s., while the latter implies liminf/e^oo f{xk) = /* a.s. in view of the condition XI^q 7^ ~ Since the 
set X is closed, all accumulation points of {x/e} lie in X. Furthermore, since f{xk) /* along a subsequence a.s., 
by continuity of / it follows that {xk} has a subsequence converging to some random point in X* a.s. Moreover, 
since — ^*||} is convergent for any G X* a.s., the entire sequence {xk} converges to some random point 

in X* a.s. ■ 



7 



Under the Lipschitz continuity of the gradient and the strong convexity of the objective, an expected error bound 
may also be provided for the method. During the development of the error bound, the following intermediate result 
assumes relevance. 

Lemma 3: Let Assumption 1 hold, and let / be differentiable over the set X with Lipschitz gradients with 
constant L > 0. Also, assume that the optimal set X* of problem (1) is nonempty. Let the sequence {xk} be 
generated by algorithm (3) with any (deterministic) stepsize jk > 0. Then, for any G X* and any k > 0, the 
following holds almost surely: 

E[||xfc+i-x*|p I J-fc] < ||xfe-x*|p+7^E[||^fe|p| J-fe] -7fe(2-7fcL)(xfc-x*f (V/(xfe^ 
Proof: By the first-order optimality conditions, a vector x* is optimal for the problem if and only if satisfies 

X* = nx(x* - 7V/(x*)) for any 7 > 0. 
By the definition of the method and the nonexpansive property of the projection operation, we obtain for all > 0, 

- = \\Ux{xk - 7fc(V/(xfc) + Wk)) - nx(x* - 7fcV/(x*))|p 
< \\xk - X* - 7fc(V/(xfc) -^Wk- V/(x*))||^ 
Taking the expectation conditioned on the past, and using E[wk \ J-'k] = (cf. Eq. (4)), we have 

E[\\x,+, - x*r I J-,] < \\x, - x*f + 7|||V/(xfc) - Vf{x*)r+llE[\\w,r I J-,] 

- 2jk{xk - x*f{Vf{xk) - V/(x*)). 
The Lipschitz gradient property for a convex function is equivalent to co-coercivity of the gradient map with constant 
1/L, (see [16, Pg. 24, Lemma 2]), i.e., for all x,y & X, 

^ ||V/(x) - Vf{y)f <{x- yf{Vf{x) - Vf{y)). 

Therefore, for any x* G X* and any k > 0, 

E[||xfc+i-x*|p I J-fc] < \\xk-x''f^7lE[\\wkf\Tk] -7ife(2-7fc^)(^fc-^*f (V/(xfc)- V/(x*)). 

■ 

In what follows, we will often use a stronger version of Assumption 2(b), given as follows. 
Assumption 3: The errors Wk are such that for some u > 0, 

E[\\wkf\Tk] < a.s. for all k>0. 

Next, we provide an error bound for algorithm (3) under the assumption that f{x) is a strongly convex function 
with Lipschitz gradients. Note that requiring that f{x) is strongly convex over a set K follows if F{x^^{uj)) is a 
strongly convex function for uj e Cl, where is a set of positive measure defined as 

n^{uj:3r]>0,{y- xf{\/F{y, ^{u)) - VF(x, ^u))) > r]\\x - yf for all x,y e K} . 

Less formally, we merely require that ^ strongly convex function with positive, but arbitrarily small, 

probability to ensure that f{x) is strongly convex over K (see [70]). 

Lemma 4 (Strongly convex function with Lipschitz gradients): Let Assumptions 1-2 hold. Also, let / be differ- 
entiable over the set X with Lipschitz gradients with constant L > and strongly convex with constant r] > 0. 
Then, the sequence {xk} generated by algorithm (3) converges almost surely to the unique optimal solution of 
problem (1). Furthermore, if the stepsize satisfies < 7/c < |; for all /c > 0, we then have: 
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(a) The following relation holds almost surely: 

E[\\xk+i - x*f I Tk] < (1 - 7fc(2 - IkL)) \\xk - x*f+jlE[\\wkf \ 7"^] for all fc > 0. 

(b) If Assumption 2(b) is replaced with Assumption 3, then the following relation holds almost surely: 

E[||xfe+i - < (1 - 7?7fe(2 - 7fcL))E[||xfe - +7^^2 for all ^ > 0. 

Moreover, lim/e^oo ^[\\xk — = 0. and for every e > 0, 

Prob {\\xj - < e for all j > A:) > 1 - ^ ^[\\xk - + ^7^^ j for all k > 0. 

Proof: The existence and uniqueness of the optimal solution of problem (1) is guaranteed by the strong 
convexity of f{x). The convergence of the method follows by Proposition 1. To establish the relation in part (a) 
for the expected value E[||x/c+i — we use Lemma 3, which implies for the optimal x* and all A: > 0, 

E[||x,+i-x*|p I J-,] < ||x,-x*|p+7'E[||^fe||' l^fe] -7fe(2-7ife^)(^ife-^*r(V/(xfe)-V/(x*)). 

By the strong convexity of f{x), we have {x — y)^{V f{x) — V f{y)) > r]\\x — yW^ for all x^y G X, which when 
combined with the preceding relation implies for all A: > 0, 

E[\\xk+i - x*f I < (1 - 7fer/(2 - 7fcL)) \\xk - x*f+jlE[\\wkf \ , (6) 

thus showing the relation in part (a). 

The relation in part (b), follows from inequality (6) by using Assumption 3 and by taking the total expectation. To 
show the other results in part (b), we apply Lemma 2. For this, we need to verify that all the conditions of Lemma 2 
hold. Since < ^ it follows r^7/c(2 — jkL) > 0. Also, in view of r] < L, we have 7^7^(2 — Jk^) < 1. 
Obviously, i^'^jl > for all k. Since Assumption 2(a) holds, we have XlfcLo Vlki'^—JkL) = oo and XI^q ^ 
Furthermore, since ^ 0. we have 

lim — — - — — = lim — — = 0. 

fc^oo r]jk{2 - jkL) k^oo r][2 - jkL) 

Hence, the conditions of Lemma 2 hold and the stated results follow. ■ 
Lemma 4 will be employed in developing our adaptive stepsize schemes. Before proceeding, we make the 
following comment regarding the lemma. 

Remark 1: The result in part (a) of Lemma 4 is similar to a result in [10], which was derived by requiring only 
the strong convexity of the function /. Here, we make the additional assumption that the gradients are Lipschitz 
continuous and this assumption gains relevance when we employ local random smoothing in Section V. Furthermore, 
our result depends on the expectation of gradient errors, E , with Wk defined in (3). Note that, in contrast, 

the result in [10] depends on the expectation of the subgradient norms, E [||G(x, ^) |p] , where G(x, ^) G dxF{x^ ^). 

B. A recursive steplength scheme 

A challenge associated with the implementation of diminishing steplength schemes lies in determining an 
appropriate sequence {7/c}. The key result of this section is the motivation and introduction of a scheme that 
adaptively optimizes the steplength from iteration to iteration. Our adaptive scheme relies on the minimization of 
a suitably defined error function at each step. We start with the relation in part (b) of Lemma 4: 

E[||xfc+i - x*f ] < (1 - 7?7fc(2 - 7fcL))E[||xfc - x^'f] + ^lu^ for all > 0. (7) 
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When the stepsize is further restricted so that < 7/c < we have 

1 - 7^7/^(2 - 7/eL) < 1 - r]-fk. 

Thus, for < 7/e < inequaUty (7) yields 

E[||xfe+i < (1 -7/7fc)E[||xfc +7^^2 for all > 0. (8) 

We now use relation (8) to develop an adaptive stepsize procedure. Loosely speaking for the moment, let us 
view the quantity E[||x/e+i — as an error e/e+i of the method arising from the use of the stepsize values 

7o, 7i, . . . , 7/c. Also, consider the worst case error which is the case when (8) holds with equality. Thus, in the 
worst case, the error satisfies the following recursive relation: 

e/c+i(7o,...,7fc) = (1 -r?7/c)e/c(7o,...,7/c-i) +7fc^^^- 

Then, it seems natural to investigate if the stepsizes 70, 7i • • • , 7fe can be selected so as to minimize the error e/c+i. 
It turns out that this can indeed be achieved and minimizing the error ek+2 at the next iteration can also be done 
by selecting "yu+i as a function of only the most recent stepsize 7/c. To formalize the above discussion, we define 
real- valued error functions 6/^(70, 7/c-i) as follows: 

efe(7o,...,7/c-i) = (1 -^7fe-i)e/c-i(7o,...,7/c-2) +7Li^^ for k>l, (9) 

where eo is a positive scalar, r] is the strong convexity parameter and v'^ is the upper bound for the second moments 
of the error norms \\wk\\- 

In what follows, we consider the sequence {7^} given by 

7o* = ^eo (10) 
7fe* = 7fc-i (1 - f 7fe*-i) forallfc>l. (11) 

We often abbreviate 6^(70, 7/c-i) by Ck whenever this is unambiguous. We show that the stepsizes ji, i = 
0, . . . , /c — 1, minimize the errors Ck over an (0, -^]^, where L is the Lipschitz constant. In particular, we have the 
following result. 

Proposition 2: Let 6/^(70, . ,7fe-i) be defined as in (9), where eo > is such that ^ eo < j^, with L being 
the Lipschitz constant for the gradients of /. Let the sequence {7^} be given by (lO)-(ll). Then, the following 
hold: 



(a) The error Ck satisfies 



efc(7o . • • • , 7fc-i) = ^ ^ - ^• 



(b) For each k > 1, the vector (70 , 71, • • • , Ik-i) minimizer of the function e/c(7o, . . . , 7/c-i) over the set 

Gfc = |a G : < < for j = 1, . . . , /c| . 
More precisely, for any k >1 and any (70, . . . , 7^-1) G G^, we have 

e/c(7o,...,7/c-i) -efc(7o.---,7fc-i) > ^^(7/c-i - 7fc-i)^- 

(c) The vector 7* = (70 , 7i , • • • , 7^-1 ) is a stationary point of function 6/^(70, 71, ... , 7/e-i) over the set G^. 
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Proof: (a) We use induction on k to prove our result. Note that the result holds trivially for /c = from 
(10). Next, assume that we have e/e(7o , • • • •> Ik-i) ~ ^ 7fe some and consider the case for /c + 1. By the 
definition of the error m (9), we have 

e/c+i(7o.---,7fc) = (1 -^7fe)e/c(7o.---,7fc-i) +7fe^^ = (1 - ^7fe)— 7^ + 7fe^^ 
where the second equality follows by the inductive hypothesis. Hence, 



efe+i(7o. • • • .7/c) = ^ 7fe (^1 - TO + 2 ^^j ^ ^ V 2 



^ ... ^ • :^ -v 7? v :^ -v 7? ^^+1' 
where the last equality follows by the definition of 7^^^ in (11). 

(b) We now show that (70 , 7i , • • • , 7^-1 ) minimizes the error e/c for all /c > 1. We again use mathematical induction 
on k. By the definition of the error ei and the relation 61(70) = ^7i shown in part (a), we have 

2^2 



ei(7o) - ei(7o) = (1 - TO)eo + ^ 7o - ^^i- 
Using 7i = 7o (1 - f 7o), we obtain 

ei(7o) - ei(7o*) = (1 - Vlo)eo + ^'70 - ^7o (l - = (1 - to)^ 7o + ^'7o - ^7o* (l - |7o) , 

where the last equality follows from eo = ^ 7o • Thus, we have 

ei(7o) - ei(7o) = -2i^'7o7o + ^^'70 + (7o*)' = (7o - l*o f ■ 
Now suppose that e/e(7o, . . . ,7/c-i) > e/e(7Q, . . . ,7^_i) holds for some k and any (70, . . . ,7/c-i) ^ ^^/c- We 
want to show that 6/^+1(70, • • • ,7fc) > e/c+i(7o5 • • • ^7^) holds as well for all (70, . . . ,7/c) G G/c+i. To simplify 
the notation we use e^^^ to denote the error e/c+i evaluated at (7o,7i, • - - ^it)^ ^fc+i when evaluating at an 
arbitrary vector (70, 7i, • • • , 7/c) ^ ^/c+i- Using (9) and part (a), we have 

e/c+i - 4+1 = (1 - TO)e/c + ^^7fe - ^7fc+i- 

Under the inductive hypothesis, we have e/e > e^. Using this, the relation = ^7^ of part (a) and the definition 
of 7^+1, we obtain 

e.+i - eU, > (1 - 7,7.)^7.* + ^'ll - ^7^ (l - ^ll) = - ll? • 

Hence, we have 6^(70, . . . , 7/c-i) - e/c(7o ^ • • • ^ ll-i) ^ ^^(7fe - 7^^ for all /c > 1 and all (70, . . . , 7/c-i) G G/c. 
Therefore, for all /c > 1, the vector (70, ... , 7/c-i) G G/c is a minimizer of the error e^. 

(c) By the choice of eo, we have < 79 < Observe that since 7/ < L, it follows that < 7^ < 7o, and by 
induction we can see that < 7^ < for all A: > 1. Thus, (7q , . . . , 7^_i) G G/c for all /c > 1. 
Now, we proceed by induction on k. For k = 1, we have 

— = -7/eo + 27oi^ . 
970 

Thus, the derivative of ei vanishes at 7q = ^ eo, which satisfies < 79 < ;^ by the choice of eo. Furthermore, 
note that the function ei(7o) is convex in 70. Hence, 7o = ^ is the stationary point of ei over the entire real 
line. Suppose now that for /c > 1, the vector (70, • • • ,7/e-i) is the minimizer of over the set Gk- Let us now 
consider the case of A: + 1. The partial derivative of e/e+i with respect to 7^ is given by 

k £—1 Ik \ k 

^ = -r?eo n (1 - r?7i) - E If (1 " m,) + 2i^^. fl (1 - r?7,), 
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where < £ < k — 1. By factoring out the common term 11^=^+1(1 ~ Vli)^ we obtain 



-1 



From the definition of e/c in (9) we can see that 

k 



Y[ (l-7/7z). (12) 



/c-l 



e/c+i(7o, . . . ,7/c) = eo " ^7i) + X] H ~ ^^i) ) + ^^7fc- 
By combining relations (12) and (13), we obtain for all £ = 0, . . . , /c — 1, 



(13) 



= (-7^6^(70,..., 7£_i) + 2i^^7£) Yi (l-^7i), 



where for ^ = 0, we have 6^(70, ... , 7^-1) = gq. By part (a), there holds —7^6^(70 , . . . , 7|_i) + 2i^^7| = 0, thus 
showing that vanishes at (7^, . . . , 7^) G G/e+i for all £ = 0, . . . , A: - 1. 

Finally, we consider the partial derivative of e/e+i with respect to 7/^, for which we have 



de 



k-l 



k-2 



k-l 



/c+1 



d-fk 



-r]eo 



i=0 



i=0 



Using relation (13), we obtain 



djk 



-^e/e(7o, . . . ,7fe-i) + 2i/^7fc. 



By part (a), we have -rjCkijo, . . . , 7^_i) + 2z/^7^ = 0, thus showing that vanishes at (70 , . . . , 7^) G Gk+i- 

Thus, by induction we have that (7q , . . . , 7^) is a stationary point of e/c+i in the set G/e+i. ■ 
We observe that in Proposition 2, the minimizer (7q, . . . , 7^_i) of the function Ck over the set Gk is unique up to 
scaling by a factor P < 1. Specifically, the solution (7q , . . . , 7^_i) is obtained for an initial error eo > satisfying 
eo < Suppose that in the definition of the sequence {7^} instead of eo we use (3eo for some (3 G (0, 1). Then 
it can be seen (by following the proof) that, for the resulting sequence. Proposition 2 would still hold. 



C. Convergence theory 

We next show that the proposed RSA approximation scheme discussed in Section III-B leads to a convergent 
algorithm. We prove this in a more general setting for a stepsize with a form similar to that seen in constructing the 
optimal choice. The following proposition holds for any stepsize of a form similar to the optimal scheme of (11). 

Proposition 3 (Global convergence of RSA scheme): Let Assumptions 1 and 3 hold. Let the function / be differ- 
entiable over the set X with Lipschitz gradients and the optimal solution set of problem (1) be nonempty. Assume 
that the stepsize sequence {jk} is generated by the following self-adaptive scheme: 

7/c = 7/c-i(l - C7/C-1) for all /c > 1, (14) 

where c > is a scalar and the initial stepsize is such that < 70 < ^. Then, the sequence {x^} generated by 
algorithm (3) converges almost surely to a random point that belongs to the optimal set. 

Proof: We employ Proposition 1. To apply this proposition, it suffices to verify that Assumption 2 holds. First 
we show that Xl^o 7i = From (14) we obtain 

fc+l / k \ k 

£=1 \i=0 J i=0 
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By dividing both sides by ^fliLi 7i) ' it follows that 

k 

7/c+i = 70 Y[{^ - Hi). (15) 

i=0 

Since 70 G (0, from (14) it follows that {7/c} is positive nonincreasing sequence. Therefore, the limit lim/e^oo Ik 
exists and it is less than ^. Thus, by taking the limit in (14), we obtain lim/c^oo 7/c = 0. Then, by taking limits 
in (15), we further obtain 



k 

lim - n^) = 0. 



i=0 

^00 



To arrive at a contradiction suppose that Xli^o 7i < co. Then, there is an e G (0, 1) such that for j sufficiently 
large, we have 

k 

c ^ 7i < e for all k > j. 

i=j 

Since fliLjll — ^7^) > 1 — cY^^^j 7^ for all j < k, by letting /c ^ 00, we obtain for all j sufficiently large, 

00 00 
[](1-C70 > l-c^7, > l-e>0. 



This, however, contradicts the fact lim/e^oo ni=o(-'^ ~ ^7*) ~ ^- Therefore, we conclude that Xli^o 7^ ~ 
Now we show that Xli^o 7*^ ^ Fi'om (14) we have 

7/c = Ik-i - for all k>l. 

Summing the preceding relations, we obtain 

k-l 

7/c = 7o - c ^ 7i^ for all k>l. 

i=0 

By taking limits and recalling that lim/c^oo 7/c = 0. we obtain 

00 

E2 70 ^ 

i=0 

Assumption 3 and relation X]^o7*^ ^ ^ yield X^^q [H^^H^ I < 00. Hence, Assumption 2 holds. ■ 
Note that Proposition 3 applies to algorithm (3) with the stepsize sequence {7^} generated by the recursive 
scheme (11). Thus, we immediately have the following corollary. 

Corollary 1 (Convergence of RSA scheme): Let Assumptions 1 and 3 hold. Let the function / be differentiable 
over the set X with Lipschitz gradients with constant L > and strongly convex with parameter r] > 0. Let the step- 
size sequence {7^ } be generated by the recursive scheme (11) with eo = E [ 1 1 xq — x* | p] . If ^ E [ 1 1 — | p] < ;^ , 
then the sequence {x^} generated by algorithm (3) converges almost surely to the unique optimal solution x* of 
problem (1). 

Proof: The existence and uniqueness of the optimal solution follows by the strong convexity assumption. 
Almost sure convergence follows by Proposition 3. ■ 
Note that when the set X is bounded, in Proposition 1 we may use cq = max^^^^^x 11^ ~ ^IP the results 
will hold as long as ^ max^^^^^x — < 
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In the following, we discuss a recursive stepsize for algorithm (3) as appHed to a nonsmooth but strongly convex 
function f{x) = E[F(x,^)]. Let G(x,^) be a subgradient vector of with respect to x, i.e., G(x,^) G 

Assume that there is a positive number M such that 

E[||G(x,Oin forallxGX. 

We have the following convergence result, which obviously also holds for smooth problems. 

Proposition 4 (Convergence of RSA with a nonsmooth objective): Consider problem (1) and let Assumption 1 
hold. Also, let the set X be compact and the function / be strongly convex over X with constant r]. Assume that 
there is a scalar M > such that E [||G(x, < for all x e X. Consider the following algorithm: 

^/c+i = {xk - 7kG{xk,^k)) , (16) 

where G X is a random initial point independent of {^k} and is a (deterministic) stepsize. Consider the 
self-adaptive stepsize sequence {7^} defined by 

Ik = 7fe-i(l - V7k-i) for all k>l, 
where D = max^^^^^x — ?/||. Assuming that < ^, we have 

E[||xfe-x*||2] <^7l forallfc>l. 

Proof: The proof is based on verifying that, for the algorithm in (16), Proposition 2 holds, where 2z^^ is 
replaced by and eo = D'^. Then, the rest of the proof is similar to that of Proposition 1. ■ 

IV. A CASCADING STEPLENGTH STOCHASTIC APPROXIMATION SCHEME 

In Section III, we presented a stochastic approximation scheme in which the sequence of steplengths is determined 
via a recursion that relies on optimizing the error estimates. A key benefit of such a recursion is that the steplength 
choice is not left to the user. In this section, we introduce an alternate avenue for specifying steplengths that also 
considers a diminishing steplength framework but uses a markedly different approach for determining the steplength. 
In particular, the scheme relies on reducing the steplength at a set of epochs while the steplengths are maintained as 
constant between these epochs. The details of this stochastic approximation scheme (called the cascading steplength 
stochastic approximation (CSA) scheme) are presented in Section IV-A while convergence theory is provided in 
Section IV-B. 



A. A cascading steplength scheme 

Our technique is based on the properties derived from problems possessing strongly convex objectives. Specifi- 
cally, we obtain the following result from the inequality in Lemma 4 when the stepsize is maintained as constant. 

Proposition 5: Let Assumptions 1 and 3 hold. Also, let / be differentiable over the set X with Lipschitz gradients 
with constant L > and strongly convex with constant r] > 0. Let the sequence {xk} be generated by (3) with 
constant stepsize 7/c = 7 for all k > 0, where 7 G (0, |;). Then, we have 




where ^(7) = 1 — 7^7(2 — 7L) and x* is the optimal solution of problem (1). 

Proof: Follows from the inequality in part (b) of Lemma 4. ■ 
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From inequality (17), we obtain the following relation 

2 2 

E[||xA.-x*|p] <^(7)^E[||xo-x*|p]+ / . . forallfc>l, (18) 
V ^ / i - gyj) 

Transient error ^ ^ ^ 

Persistent error 

where the expected distance E[||xfe — is bounded by the sum of two error terms: 

(1) Transient error: The transient error, given by g(7)^E[||xo — decays to zero as /c ^ oo. In effect, the 
contractive nature of this error, as arising from ^'(7) < 1, ensures that the transient error can be reduced to 
an arbitrarily small level. 

2 2 

(2) Persistent error: The persistent error, given by jzr^^, is invariant to increasing the number of iterations, 
denoted by k. Its reduction, as we proceed to show, necessitates reducing 7. 

Our cascading steplength scheme basically requires specifying a rule for deciding at what iteration to decrease 
the steplength and to what extent it should be decreased. The iterations during which the stepsize is kept fixed is 
referred to as a constant steplength regime or just a regime. Given the two error terms, our scheme can be loosely 
represented as an infinite sequence of regimes of finite duration. In fact, we proceed to show that the duration 
of the regimes is an increasing function. Entering a new regime is marked by a reduction in the steplength. In 
fact, since a finite reduction in the steplength occurs between consecutive regimes, the steplength sequence would 
naturally converge to zero if there is an infinite number of the regimes. Suppose one is at the beginning of the tth 
regime, where the steplength is 7^ and the current iteration number is K. The steplength 7^ is maintained constant 

during regime t. Furthermore, suppose that at the beginning of the tth regime, the transient error is greater than 

2 2 

the persistent error for 7^, i.e., E[||xx — > jz^^- Since < ^(7^) < 1, E[||xx — decreases when 

multiplied with q{'^tY k >0. The larger k, the smaller q{'yt)^E[\\xK — so there exists /c > for which 

q{'yt)^E[\\xK — IP] will drop and remain below the persistent error jz^j^. We let Kt be the index k just before 
this drop takes place, i.e., Kt is the largest k for which the following inequality holds: 



Therefore, Kt specifies the duration of regime t, during which the stepsize is fixed at 7^. 

The next question is how one should go about reducing the persistent error. We observe through the next result 
that by reducing 7^, the persistent error does indeed reduce. 

Lemma 5: Consider the persistent error given by ^(7) = izr^^^ where ^(7) = 1 — 7^7(2 — 7L) and 7 G (0, |;). 
Then, this error is an increasing function of 7. 

2 

Proof: By using ^^(7) = 1 — 7^7(2 — 7L), for the persistent error we obtain P(7) = ^(^2--fL) • Therefore, the 
derivative of the persistent error with respect to 7 is given by ^'(7) = ^ > for all 7 7^ |;. ■ 

Therefore, when 7^ is reduced to 7^+1, the persistent error does indeed reduce. This drop in steplength is referred 
to as the cascading step and marks the commencement of a new regime. As earlier, in this regime, the persistent 
error will be smaller than the transient error and the process of determining i^t+i can be repeated. Therefore, we 
may view the scheme as a diminishing steplength scheme where the steplength is reduced at a sequence of time 
epochs and between these epochs, it is maintained constant. 

We now proceed to describe the scheme more formally. It can be viewed as having two stages, of which the 
second stage repeats infinitely often in a consecutive fashion. The first of these is an initialization phase. We assume 
throughout that the constraint set X is bounded, so that E[||xo — < D'^ with D = max^^^^^x \\x — y\\. Next, 

we describe each of the stages in cascading scheme in some detail. 
Cascading steplength stochastic approximation (CSA) scheme: 
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Initialization phase (Phase I): A requirement to begin making gradient steps, is that the persistent error has to be 
smaller than D^. If this were not the case, then 7 would have to be reduced until the persistent error is smaller 
than D'^. More specifically, given a parameter 6 e (0, 1), we determine the integer £ such that 

where ^(7) = 1 — 7/(2 — L7) and < 7 < |;. We define 70 as 70 = 76^^, qo = ^(70), and 



Kq = max < G Z+ : q^D'^ > - 

k 1 



Qo. 

Finally, we exit this phase by defining K_i = 0, setting t = 0, and going to Phase lit. 



(20) 




Fig. 1: The cascading scheme with phases IIq (left), IIi (center) and II2 (right). 



Constant steplength phase (Phase lit): Define Kt = Ylj^o^t- iteration indices k with k G {Kt-i 

1, . . . , ^t}, the stepsize is kept constant and equal to 7^, i.e.. 



Then, we increase t by setting t = 
determine the integer Kt as follows: 



7fe - 
= t 



-ft forall^ = ^,_i + l,...,^,. 
1, reduce the stepsize by letting 7^ 



jt-iO, compute qt = q{-ft) and 



Kt = max {keZ^: q^2* 

k 




7^1 
l-qt 



(21) 



We then repeat phase 11^ until the number k of iterations (i.e., gradient steps) exceeds a pre- specified threshold, in 
case of which the algorithm terminates. 

We provide a graphical representation of these phases in Figure 1 where the circles around x* represent thresholds 
beyond which the transient error is less than the persistent error. For instance, in Figure 1 (plot to the left), phase IIq 
requires Kq steps to reach the first circle. Once, the steplength is reduced by a factor 0, the phase IIi commences 
and requires Ki steps to reach an analogous error threshold where the transient error is equal to the persistent 
error; this is illustrated in Figure 1 (plot in the center). Finally, phase II2 requires K2 to reach an even smaller level 
of persistent error, as depicted in Figure 1 (plot to the right). Note that whenever the steplength is reduced, the 
persistent error is immediately reduced (Lemma 5). Thus, the stepsize is essentially a piecewise constant decreasing 
function of the iteration index k. 

The next result establishes the correctness of the cascading scheme by showing that Kt in Phase lit is finite, so 
the scheme is well defined. 
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Proposition 6: Let Assumptions 1 and 3 hold. Also, let / be differentiable over the set X with Lipschitz gradients 
with constant L > and strongly convex with constant > 0. Assume that the set X is compact and let 
D = mdiXx^yex — vW- Then, Kf is finite for all t >0. 

Proof: We use induction on t to show that Kf is well defined and for alH > 0, 



(22) 



First note that, since 70 G (0, |;) and the steplength is non-increasing in k, we have ^'(7^) G (0, 1) for all t > 0. 



For t = 0, from Proposition 5 and the boundedness of the set X we have 

2 2 

E[\\xk - < q^D^ + 3^ for all k>0, 



(23) 



1-^0 

where qo = ^(70) = 1 — ^7o(2 — 70^) and 70 is as given in the initialization phase of the cascading scheme. Since 

2 2 

7o is selected in the initialization phase so that D"^ > and ^q^^ is decreasing as k increases, there exists 

^0 

q^D^ > ^ for = 0, . . . Kq, from (23) we have 

E[feo-^*ll'] <27(f°^', 
where we use the fact Kq = Kq (see Phase lit for t = 0). 



an integer K > 1 such that q^ D'^ < j^^. Note that Kq = K — 1, thus Kn is well defined. Furthermore, since 



1:1 \ r\ 



Transient term 
Persistent term 
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(a) Transient vs. Persistent 
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Number of iterations (N) 



(b) Total 



Fig. 2: Elements of cascading scheme for the stochastic utility problem. 



Now assume that Kt is well defined and relation (22) holds for t. We next show that Kt-\-i is also well defined 
and relation (22) holds for t + 1. Note that the steplength = 7^+i is used for k > Kt. From Proposition 5 where 
we replace xq with x^^, by replacing 7 by 7^+1 letting g^+i = ^(7^+1), we have for k > Kt, 



E[\\x,-x*r]<q^^,E[\\xii,-x*r^ 



1 - qt+i ' 
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By inductive hypothesis relation (22) holds, so it follows 



Xk - X* f] < </,'Vi2*+i I M + ^^±1— for all k > (24) 




3 

] 

Term 1 

Consequently, i^T^+i is defined as the largest positive integer k for which term 1 is strictly greater than term 2, i.e., 

i^t+i ^ max L G Z+ : ,,^,2^+^ i\[ ,f ] > 3'+^^' 



(see the definition of Kt in (21)). Noting that Kt^i = Kt^Kt^i (see Phase 11^) and <?^^+i2^+^ (115=0 ^ 

2 2 _ _ _ 

j^^^^ for A: = i^t + 1, . . . , i^t+i, from (24) with A: = i^t+i, we obtain 



> 




thus showing relation (22) for t + 1 and completing the proof. ■ 
The transient and persistent error trajectories are illustrated in in Figure 2 for a problem discussed later in Section 
VI-Al. In Figure 2a, the transient and persistent terms of the error are plotted. The persistent error, as expected, is a 
piecewise constant decreasing function of the iteration count with the jumps occurring whenever the steplengths are 
reduced. The transient error is a plot of q^2^ Ilj^o Qf^D'^ with respect to k. This function is a decreasing function 
when k G {Kt-i^ . . . ^Kt — 1}. As soon diS k = Kt, m the transient error the factor 2^ is replaced with 2^+^, 
leading to the increase in transient error at that juncture. The total error, which is the summation of two terms, is 
showed in Figure 2b. 

Remark on choice of 0: Recall that specifies the rate at which the steplength is dropped over consecutive steps 
in the cascading scheme. It can be readily observed from the bounds derived on Kt that if 6> ^ 1, then Kt ^ ^ 
thus implying that the steplength is kept constant for a very short period. This is intuitive since a conservative drop 
in steplengths would imply that these drops have to occur more frequently to ensure that the sequence is driven to 
zero. Conversely, if ^ ^ 0, then Kt can grow to be quite large. 

B. Global convergence theory 

In this section, we prove that algorithm (3) using the cascading steplength scheme is indeed convergent to the 
optimal solution of problem (1). 

Lemma 6: Let ^^(7) = 1 — 27^7 + 7/^7^ and let 7/ < L. Then, we have 

0<^Mi(2» ,„,,e(0,i), 

-ln(,(7)) ^ 2^ f„,7€(0,i). 

7 L — T] \ ^/ 

Furthermore 

li„.zM^=2,. 

7^0 7 



Proof: Let r{^) = ^^^M^. Note that the function ^^(7) = 1 — 27/7 + rjL^'^ is nonnegative for all 7 since 
< T] < L. Furthermore ^(7) < 1 for 7 < Thus, r(7) > for < 7 < |;. We next show that r(7) is bounded 
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from above as stated. To show that the sequence is bounded, we employ the Taylor expansion of \n{q{j)). First, 
we write 

- ln((7(7)) = - ln(l - (3{-f)) with /3(7) = 2r]-f - rjLj'^. 

Noting that /3(7) = 1 — ^(7) G (0, 1), we then use the fact ln(l — x) = — Xl^i ^ for \x\ < 1, and obtain 

- ln(.(7)) = g — < g/5 (7) = = 
Using /3{j) < 2/^7, we further obtain 

-ln(^(7)) ^ 2r] 



7 ^(7) 

The function q{j) is convex over R and it attains its minimum at 7* = with the minimum value (7* = 1 — £. 
The minimum value satisfies g'* > when L > r]. Thus, when 7/ < L, we have ^^(7) > 1 — implying that 

-ln(g(7)) ^ 27?L 



7 

The relation for the limit is obtained by applying L'Hopital's rule, as follows: 

lim -Hl-2V, + VLJ^) ^ j.^ 2rj-2,Lj ^ 

7^0 7 7^0 1 — 2?77 + 7^1/7^ 

■ 

Proposition 7 (Cascading steplength stochastic approximation (CSA) scheme): Let Assumptions 1 and 3 hold. 
Also, let / be differentiable over the set X with Lipschitz gradients with constant L > and strongly convex with 
constant r] > 0, where L > r]. Assume that the set X is compact and let D = max^^^^^x 11^ ~ ^11- Let the sequence 
{xk} be generated by algorithm (3) and cascading steplength scheme with 70 G (O, |;) and G (0, 1). Then, {xk} 
converges almost surely to the unique optimal solution of problem (1). 

Proof: The result will follow from Proposition 1 provided we verify that Assumption 2 holds, i.e., XI^q 7^ ~ 
00 and Xlfc^o 7fc ^ According to Phase 11^ of the cascading scheme, we have = 7^ for /c = Kt_i + 1, . . . , 
with 7t = 6^ Jo and Kt = Kt-i -\- Kt. Therefore 



00 



^7. = 7oE^.^^ E7^=7o'E^ 

k=0 j=0 k=0 j=0 

Thus, we need to show 

00 00 

KjO^ =00, KjO'^^ < 00. 

j=o j=0 

From the definition of Kt in (21) we have 



while Kt^l satisfies 




2,, 2 



#*^'2Mn^f (26) 



Relation (26) and the fact 7^ = 6>^7o (see Phase lit of the cascading scheme) yield 
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where Kj = Kj-\-l. Consequently, by taking logarithms and noting that qj G (0, 1) for all j (since jqO^ G (0, 2/L) 
by the choice of 70 and 6 G (0, 1)) we have 



j=0 



V 



DHl-qt) 

Therefore, by multiplying and dividing by 70^-^ , we obtain 



v 



DHl-qt) 



t 



j=0 



> - In 



75 



,2 / 9 



DHl-qt) 



Note that Qj = I - 2r]jo0^ + r]L{joO^)'^ with 70 G (0, 2/1) and G (0, 1). Thus, by Lemma 6 we have 
2r]L/{L — T]), implying 



< 



27o7^L 

L — T] 



j=0 



75 



(¥)* 



Taking limits on both sides, we have that 

L — T] 



y^K^O^ > lim - In 



DHl-qt 
( 



3=0 



DH^-qt) 



The limit on the right can be simplified by substituting = I — 2r]jo0^ + ijLjqO'^^, leading to 



lim In 

t^oo 



(?)* 



V 



l-qt 



lim In 



\D^{2r]joO'-r]Ljiei) 



+00, 



where we also use G (0, 1). Hence, X]j=o ^j^^ ~ Since Kj = Kj and 6> G (0, 1), it follows that 

00 CX) OD CX) ^ 

00 = E ^^-^^ = E ^^-^^ + E = E ^i^' + 

J=0 j=0 j=0 j=0 

implying that Xl^o ^i^"^ 

It remains to show that X^^q ^t^^^ < From (25) and the fact qj G (0, 1) for all j, we have that 



This allows for obtaining an upper bound on Kt, given by 

In 



DHl-qt) 



In qt 



(27) 



The desired result will follow by the Cauchy root test, if we show that 

lim [Ktd^'f < 1. 
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By noting that (Kt^^^f'^ = 0^{Ktfl\ it suffices to use the upper bound on Ki in (27). We proceed to analyze 
this bound, for which by letting /3(7) = 1 — ^(7) and recalling that qi = qijt) we have 



In 

Thus, 



(?)* 



V 



Hnl+ln(^)-ln(/?(7t)). 



D\l-qt) I 2 ' 



Noting that /?(7) G (0, 1) for all 7 when rj < L, we have ln(/?(7)) < 0, implying 

i/t 



— "-in(,;" ^ I • ^^^^ 

Since /3(7) G (0, 1), the denominator can be expanded in Taylor series as follows: 

- ln(g,) = - ln(l - /?(7t)) = £ > /?(7t). 

k=l 

Furthermore, since /3(7^) = r/7^(2 — Ljt) and 7^ = 706^^ with G (0, 1), we have 706^^ < 1 for t large enough, 
implying (3{^t) > VloO*. Thus, 

— In(g^) > r]joO^ for t large enough. (29) 
By combining (28) and (29), we obtain for t large enough. 



-Hn|-ln(^)' 



r?7o6'* ) 6'(r/7o)i/* ^ 2 t V 

By recalling that limt^oo t^^* = 1 and limt_j.oo c^^* = 1 for any c > 0, it follows that 

limi^y^<i lim r-ln^-iln^^^'"'''''' 



t^oo ^ - t^oo \ 2 t \ 
We next examine the limit on the right hand side. Letting a = — ln(6>/2) and 6 = — In we can write 

lim a + - = lim a^^^ IH = lim a^/^ lim IH =1. 

t^oo y t J t^oo Y at J t^oo t^oo y at J 

Therefore, lim^^oo ^/^^ ^ I? implying that 

lim {Kte^'f <e <\, 

t^oo 

As a consequence, the Cauchy-root test is satisfied and XI^q KfO'^^ < oo. ■ 

V. Addressing nondifferentiability through Local Randomized Smoothing 

In this section, we develop a smoothing approach for solving stochastic optimization problem with nonsmooth 
integrands. In Section V-A, given a nondifferentiable function f{x), we introduce a smooth approximation for 
f{x), denoted by f{x) by using local random perturbations. In Section V-B, we derive Lipschitz constants for the 
gradients associated with this smooth approximation when the smoothing is introduced via a uniform distribution. 
Finally, in Section V-C, the convergence theory of stochastic approximation schemes is examined in this modified 
regime. 
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A. Dijferentiable Approximation 

We let / be nondifferentiable and consider its approximation /, defined by 

/>)^E[/(x + z)], (30) 

where the expectation is with respect to z G M^, a random vector with a compact support. Suppose that z G is 
a random vector with a probability distribution over the n-dimensional ball centered at the origin and with radius 
e. For the function / to be well defined, we need to enlarge the underlying set X so that f{x-\-z) is defined for 
every x e X. In particular, for a set X C and e > 0, we let X^ be the set defined by: 

Xe = {y\y = x^z, xeX, z G M^, ||^|| < e}. 

We discuss our local smoothing technique under the assumption that the function / has uniformly bounded 
subgradients over the set X^, given as follows. 

Assumption 4: The subgradients of / over are uniformly bounded, i.e., there is a scalar C > such that 
\\g\\ < C for all g G df{x) and x G X^. 

Assumption 4 is satisfied, for example, when X is bounded. In the sequel, we let E[g{x -\- z)] denote the vector- 
valued integral of an element from the set of subdifferentials, which is given by 

E[g{x ^ z))] = ^^g = J g{x ^ z)pu{z)dz g{x ^ z) e df{x ^ z) a.s.^ . (31) 

The following lemma presents properties of the randomized technique (30) with an arbitrary local random distribution 
over a ball. It states that, under the boundedness of the subgradients of /, the set E[g{x -\- z)] defined above is a 
singleton. In particular, the lemma shows that / is convex and differentiable approximation of /. 

Lemma 7: Let z G be a random vector with the density distribution support contained in the n-dimensional 
ball centered at the origin and with a radius e, and let E[z] =0. Let X C be a convex set and let the function 
f{x) be defined and convex on the set Xe, where e > is the parameter characterizing the distribution of z. Also, 
let Assumption 4 hold. Then, for the function / given in (30), we have: 

(a) / is convex and differentiable over X, with gradient 

V/(x) = E[g{x + z)] for all x G X, 

where the vector E[g{x + z)] is as defined in (31). Furthermore, ||V/(x)|| < C for all x e X. 

(b) f{x) < f{x) < f{x) + eC for all xeX. 

Proof: (a) For the convexity and differentiability of / see the proof^ of Lemma 3.3(a) in [44]. The gradient 
boundedness follows by Assumption 4, relation (31), and V/(x) = E[df{x + z)]. 

(b) By definition of random vector z, it has zero mean, i.e., E[x + 2;] = x, so that f{E[x-\-z]) = f{x). Therefore, 
by Jensen's inequality and the definition of /, we have 

f{x) = f{E[x + z]) < E[f{x + z)] = f{x) for all x e X. 

To show relation f{x) < f{x) -\- eC, we use the subgradient inequality for /, which in particular implies that, 
for every x e X^ and g G df{x), we have 

f{x)<f{x)^\\g\\\\x-x\\ forallxGX,. 

^There, the vector z has a normal zero-mean distribution. Furthermore, the proof is applicable to a convex function defined over R^. However, 
the analysis can be extended in a straightforward way to the case when / is defined over an open convex set V C M^, since the directional 
derivative f'(x] d) is finite for each x ^ V and for any direction d G (Theorem 23.1 in [71]). 
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Since x G X^, we have x = x-\-z for some x e X and z with ||2:|| < e. Using this and the subgradient boundedness, 
from the preceding relation we obtain 

f{x + < f{x) + Ce for all x e X. 

Thus, by taking the expectation, we get f{x) = E[f{x-\-z)] < f{x) + eC for all x e X. ■ 



B. Smoothing via random variables with uniform distributions 

In this subsection, we consider a local smoothing technique wherein z is generated via a uniform distribution. 
Other distributions may also work such as normal, considered in [44]. However, distributions with finite support 
seem more appropriate for capturing local behavior of a function, as well as to deal with the problems where 
the function itself has a restricted domain. Our choice to work with a uniform distribution is due to the uniform 
distribution lending itself readily for computation of resulting Lipschitz constant and for assessment of the growth 
of the Lipschitz constant with the size of the problem. 

The key result of this section is an examination of the Lipschitz continuity of the gradients of the smooth 
approximation, particularly in terms of the rate that such a constant grows with problem size. 

Suppose z G is a random vector with uniform distribution over the n-dimensional ball centered at the origin 
and with a radius e, i.e., z has the following probability density function: 



1 



where 



7r2 



r(t + i) 



for ll^ll < e, 

otherwise, 
and r is the gamma function given by 

(^)! if n is even, 



(32) 



2(ri + l)/2 



if n is odd. 



The following lemma shows that / is convex and differentiable approximation of / with Lipschitz gradients, where 
the Lipschitz constant for V/ is related to the norm bound for the subgradients of /. 

Lemma 8: Let z G be a random vector with uniform density distribution with zero mean over a n-dimensional 
ball centered at the origin and with a radius e. Let X C R"^ be a convex set and let the function f{x) be defined 
and convex on the set Xe, where e > is the parameter characterizing the distribution of z. Also, let Assumption 4 
hold. Then, for the function / given in (30), we have 



\\Vf{x)-\/f{y)\\<^- 



IF -^11 



for all x^y e X, 



(n-1)!! e 

where k, = - if n is even, and otherwise k, = 1. 

Proof: From Lemma 7(a) and relation (31), for any x e X, there is a vector g{z -\- x) such that g{z -\- x) G 
df{x + z) a.s. and 

V/(x) = / g{x ^ z)pu{z)dz = / g{v)p{v - x)dv, 
where the last equality follows by letting v = x -\- z. Therefore, for any x^y G X, 



l|V/>)-V/(y)|| 



{Pu{z -x)- pu{z - y))g{z)dz 
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/ \pu{z-x) - pu{z - y)\\\g{z)\\dz 
< CI \pu{z - x) - pu{z - y)\dz, (33) 



< 

Ix 



where the last inequaUty follows by using the boundedness of the subgradients of / over X^. 

Now, we let X, 7/ G X be arbitrary but fixed, and we estimate \pu{z — x) — Pu{z — y)\dz m For this we 
consider the cases where ||x — ?/|| > 2e and ||x — ?/|| < 2e. 

Case 1 {\\x — y\\ > 2e): For every z with \\z — x\\ < e, we have \\z — y\\ > e, implying that Pu{z — y) = 0, so that 
I\\z-x\\<e \Pu{z — x) — Pu{z — y)\dz = 1. Likewise, for every z with ||^ — ^|| < e, we hayc Pu{z — x) = 0, implying 



\Pu{z - x) -Pu{z - y)\dz = 1. 

z-y\\<e 



Therefore, 

/ \pu{z - x) - pu{z - y)\dz = / \pu{z - x) - pu{z - y)\dz ^ \pu{z - x) - pu{z - y)\dz 

JX, J\\z-x\\<e J\\z-y\\<e 

= 2. 

Since 2 < ||x — ?/||/e, it follows that 



\pM -x)- puiz - y)\dz < (34) 



It can be further seen that (^^^[y i > 1 for all n > 1, which combined with (34) and (33) yields the result. 
Case 2 (||x — ^|| < 2e): We decompose the integral in (33) over several regions, as follows: 



\Pu{z - x) -pu{z - y)\dz 



/ \pu{z - x) - pu{z - y)\dz ^ I \pu{z - x) - pu{z - y)\dz 

J\\z-x\\<e & \\z-y\\<e J\\z-x\\<e & \\z-y\\>e 



\pu{z - x) - pu{z - y)\dz ^ / \pu{z - x) - pu{z - y)\dz. 

z-x\\>e & ||;^-2y||<e J\\z-x\\>e & ||;^-2y||>e 

The first and the last integrals are zero, since Pu{z — x) = Pu{z — y) for z in the integration region there. 
Furthermore, in the other two integrals, the supports of Pu{z — x) and Pu{z — y) do not intersect, so that we have 
\Pu{z — x) — Pu{z — y)\ = l/{cne^) for z in the integration region there. Using this and the symmetry of these 
integrals, by letting S = {z eW^ \ \\z — x\\ < e and \\z — y\\ > e}, we obtain 

2 

\Pu{z - x) -Pu{z - y)\dz = Vs, (35) 



ixe 



where Vs denotes the volume of the set S. 

Now we want to find an upper bound for Vs in terms of ||?/ — Let Vcap{d) denote the volume of the spherical 
cap with the distance d from the center of the sphere. Therefore, 

Vs = c„e" - 2T/ea, (^^) • (36) 

The volume of the n-dimensional spherical cap with distance d from the center of the sphere can be calculated in 
terms of the volumes of (n — 1) -dimensional spheres, as follows: 



Vcap{d) 



Cn-1 (y^^Y ^ dp for d e [0, e], 
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with Cn = -^-r:^ r for Ti > 1. We have for d G [0, e], 

( 2" ^ 

KW(^) = -Cn-i(e'-^')'^ <0, 
Vcapid) = {n- l)cn-id{e^ - d^)"^ > 0, 

where V^^p and V^'^p denote the first and the second derivative, respectively, with respect to d. Hence, Vcap{d) is 
convex over [0,6:], and by the subgradient inequaUty we have 

Vcap{0) + Kap(O) d < Vcap{d) for d G [0, e]. 

Since Kap(O) = ^c^e^ and V^^p{0) = -Cn-ie^-\ it follows 

^Cne^ - Cn-ie^'-^d < Vcap{d) for d G [0, e]. (37) 

Noting that — ?/||/2 < e (since — ?/|| < 2e), we can let d = — ?/||/2 < e in (37). By doing so and using (36), 
we obtain 

2 



Vs = CnC^ - 2Vcap ( ) < 2cn-ie^ 



Finally, substituting the preceding relation in (35), we have 

2Cn-l 11^ -^11 



/ \pu{z - x) -pu{z - y)\dz < 



Since = r{^^-^i) ' ^^^^ ^^^^ 



Cn (n-1)!!' 
with K, = - if n is even, and otherwise z^: = 1. Thus, we have 



2cn-i n!! 

^7 (38) 



/ 



- x) - - y)\dz <K ^^ (39) 

By combining (39) with (33), we obtain the desired result. ■ 
It can be seen that the Lipschitz constant n ^^iy \ y established in Lemma 7 for the differentiable approximation 
/ grows at the rate of ^Jn with the number n of the variables, i.e.. 



"^(n-l)!! /TT 

lim ' — ' 



This growth rate is worse than the growth rate Y^ln(n + 1) obtained in [44] for the global smoothing approximation, 
which uses a normally distributed perturbation vector z. However, it should be emphasized that the smoothing 
technique in [44] requires the function / to be defined over the entire space since z is drawn from a normal 
distribution, which is a somewhat stringent requirement. Our proposed local smoothing technique removes such a 
requirement, but suffers from a worse growth rate. 

C. Convergence analysis of the algorithm with local smoothing 

In this section, we apply the stochastic approximation scheme presented in Section II to the smooth approximation 
/ of a nondifferentiable function /. First, we consider the case when / is convex but deterministic and then, we 
consider the case when / is given as the expectation of a convex function. 
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1 ) Deterministic nondijferentiable optimization: We apply the local smoothing technique to the minimization of 
a convex but not necessarily differentiable function /. In particular, suppose we want to minimize such a function 
/ over some set X. We may first approximate / by a differentiable function / and then minimize / over /. In 
this case, by taking the minimum over x G X in the relation in Lemma 7(b), we see that /*</*</* + eC . 
Thus, we may overestimate the optimal value /* of the original problem by at most eC, where C is a bound on 
subgradient norms of /. So we consider the following optimization problem 

min {f{x) I , where f{x) ^ E[f{x + z)] . (40) 

We may solve the problem by considering the method (3), which takes the following form 

Xk+i = ^x[xk - IkC^fixk) + ^/c)] for /c > 0, 

(41) 

Wk= gk- V/(xfe) with gk e df{xk + Zk), 
where {zk} is an i.i.d. sequence of random variables with uniform distribution over the n-dimensional sphere 
centered at the origin and with the radius e > 0. 
We have the following result. 

Proposition 8: Let / be defined and convex over some open convex set V CW^. Let X be a closed convex set 
and let e > be such that C V, where e is the parameter of the distribution of the random vector z as given 
in (32). Let Assumptions 2(a) and 4 hold. Also, assume that problem (40) has a solution. Then, the sequence {xk} 
generated by method (41) converges almost surely to some random optimal solution of the problem. 

Proof: We show that the conditions of Proposition 1 are satisfied. In particular, under the given assumptions, 
the set Xe is convex and closed (Corollary 9.1.2 in [71]). Furthermore, the function F(x, z) = f{x + z) is convex 
and finite on some open set containing the set X^ for any z e ft = \ ||^||<e}. Since z is a random variable 
with uniform distribution on the sphere we see that E[F(x, z)] = E[/(x + z)] is finite for every x e X. Thus, 
Assumption 1 is satisfied. Since / has bounded subgradients on X^ and Xk ^ X C X^, we have \\gk\\ < C. By 
Lemma 7(a), the gradients V f{x) over X are also bounded uniformly by C. Hence, 

Ikfell < |kll + ||V/>fe)|| <2C, 

implying that E[||i(;/e|p | Fk\ ^ 4C^. In view of this, and Y^^=Qlk ^ ^ (Assumption 2(a)), it follows that 
XlfeLo 7fe^[ll^fell^ I ^ ^^^^ showing that Assumption 2(b) is satisfied. By Lemma 7, the function / is 
differentiable with Lipschitz gradients over X. Thus, the conditions of Proposition 1 are satisfied and the result 
follows. ■ 

2) Stochastic nondijferentiable optimization: In this section, we apply our local smoothing technique to a 
nondifferentiable stochastic problem of the form (1). Essentially, this amounts to putting the results of Sections II 
and V-C together. We thus consider the following problem: 

minimize f{x) 

subject to X G X (42) 

where />) = E[/(x + z)] , f{x) = E[F(x, 0] , 

F is the function as described in section II, and / is a smooth approximation of / with z having a uniform density 
Pu as discussed in Section V. In view of Lemma 7(a), we know that eC is an upper bound for the difference 
between the optimal value /* = min^^^x /(^) and /* = min^^^x /(^), under appropriate conditions to be stated 
shortly. Under these conditions, we are interested in solving the approximate problem in (42). 
Note that 

/(x) = E[/(x + z)] = E[E[F(x + z,Oie]], 
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where the inner expectation is conditioned on ^ and is with respect to z while the outer expectation is with respect 
to ^. We note that the variables ^ and z are independent, and by exchanging the order of the expectations, we 
obtain: 

f{x) = E \f{x, Ol , with F(x, i) = E[F{x + z, 0] • 



Thus, the problem in (42) is equivalent to 

minimize /(^), where /(x) = E , ^) = E[F(x + ^)] 

subject to X e X 



(43) 



In the following lemma, we provide some conditions ensuring the differentiability of F with respect to x, as well 
as some other properties of F. The lemma can be viewed as an immediate extension of Lemma 7 to the collection 
of functions F(-, ^). 

Lemma 9: Let the set X and function F : VxQ ^ R satisfy Assumption 1 . Let the parameter e that characterizes 
the distribution of z be such that X^ C V. In addition, assume that the subdifferential set dxF{x^(^) is uniformly 
bounded over the set x Vt, i.e., there is a constant C such that 

ll^ll < C for all s e dxF{x, 0, and all x G and ^ G ^. 

Then, for the function F :V xVt^M. given by ^ = E[F(x + z, 0], we have: 

(a) For every ^ G 1^, the function F(-,^) is convex and differentiable with respect to x at every x G X, and the 
gradient VxF{x^ ^) is given by 

VF(x, = E[SF(x + z, 0] for all x G X. 



Furthermore, \\VxF{x,£)\\ < C for all x G X and (^eVt. 

eC for all x 
n\\ C 

'(n-1)!! 7 



(b) F(x, < F{x, < ^(^, + for all x G X and (^eVt. 

(c) ||Va;F(x,^) — VajF(?/,^)|| < ti- — ~ ^11 for all x^y e X and ^ G 1^, where = ;| if n is even, 
and otherwise k = 1. 



Proof: Under the given assumptions, each of the functions F{-^^) for ^ e ft satisfies the conditions of Lemma 7. 
Thus, the results follow by applying the lemma to each of the functions F(-, ^) for ^ G 1^. ■ 
In the light of Lemma 7, the optimal value /* of the approximate problem in (43) is an overestimate of the 
optimal value /* of the original problem (1) within the error eC. In particular, by taking the expectation with 
respect to ^ in the relation of Lemma 7(b), we obtain 

/* < /* < /* + eC. 

This motivates solving approximate problem (43). Since for every e ft, the function convex and 

differentiable over the set X, the function f{x) = E F(x,^) is also convex and differentiable over the set X 
(see [47]). Thus, the objective function / in (43) is differentiable. To solve the problem, we consider the method 
in (3), which takes the following form: 

^fc+i = ^x[xk - Iki'^fixk) + Wk)] for k>0, 

(44) 

Wk = Sk- Vf{xk) with Sk G dxF{xk + ^/c, 
We have the following convergence result for the method. 

Proposition 9: Let the assumptions of Lemma 9 hold, and let Assumption 2 hold. Then, the sequence {xk} 
generated by method (44) converges almost surely to some optimal solution of problem (43). 
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Proof: It suffices to show that the conditions of Proposition 1 are satisfied for the set X, and the functions 
and f{x). The result will then follow from Proposition 1. 
We first verify that ^) satisfies Assumption 1 and that f{x) has Lipschitz gradients over X. Under the given 
assumptions, Lemma 9 holds. By Lemma 9(a)-(b), the function F(x,^) satisfies Assumption 1. Furthermore, by 
Lemma 9(a) and (c), the function F(x, ^) is differentiable and with Lipschitz gradients for every ^ G ^1. Hence, 



(see [47]). To see that 



F(x, ^) is also differentiable with the gradient given by V/(x) = E 
the gradients V/ are Lipschitz continuous, we take the expectation in the relation of Lemma 9(c), and we obtain 
for all x^y e X, 



|V,F(x,0-V,F(^,OII 



nil 



(n-1)!! e 



\x-y\\ 



where k, = ^ if n is even, and otherwise k, = 1. Using Jensen's inequality, we further have for all x, G X, 



(n-1)!! e 



11^ -^11 



Since V f{x) = E \/xF{x^^) , it follows that V f{x) is Lipschitz over the set X. Thus, the objective function / 
satisfies the conditions of Proposition 1 . 

We now show that Assumption 2(b) is satisfied. In view of the assumption that Y1V={) 7fc ^ ^ (Assumption 2(a)), 
it suffices to show that \\wk\\ is uniformly bounded. By the definition of Wk in (44), we have for all k. 



M < \\sk\\^\\Vf{xk) 



with Sk G dxF{xk + ^/c,6), 



where x/e G X and \\zk\\ < e for all k. Thus, Xk -\- Zk G X^ for all k. By the assumptions of Lemma 9, the 
subdifferential set dxF{x^(^) is uniformly bounded over X^xVt, implying that 



\\wu\\<C^\\Vf{xu)\\ 



for all k>{). 



(45) 



We next prove that the gradients V/(x) are uniformly bounded over the set X. Taking the expectation in the 
relation || Vici^(2;, ^) || < C valid for any x G X and ^ G 1^ (Lemma 9(a)), and using Jensen's inequality, we obtain 



V.F(x,0 



< E 



< C 



for X G X. 



Since V/(x) 



l|V.F(x,OII 

VxF{x,C} , we see that ||V/(x)|| < C for x G X. This and relation (45) yields 
\\wk\\ < 2C for all k>0. 
thus showing that \\wk\\ is uniformly bounded. 



VI. Numerical results 

In this section, we present computational results of applying our adaptive and smoothing schemes to three test 
problems. Sections VI-Al, VI- A2 and VI-A3 consider a stochastic utility problem (see [10]), a bilinear matrix 
game and a stochastic network utility maximization problem, respectively. In all of these examples, we compare the 
performance of the recursive steplength SA scheme (RSA) and the cascading steplength SA scheme (CSA) with a 
standard implementation of stochastic approximation. The standard SA scheme, where the steplength sequence is 
chosen to be a harmonic sequence is referred to as the HSA scheme and is employed as a benchmark. For each 
example, we provide this comparison for 9 problems of varying size and problem parameters apart from figures 
illustrating the difference between theoretical bounds and the obtained results. Notably, the first two problems are 
nonsmooth convex problems, prompting us to work with a regularized strongly convex form. In Section VI-B, we 
discuss the sensitivity of the schemes to changes in parameters. Throughout Section VI, we use X, n, r] and e, to 
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denote the no. of iterations, the problem dimension, the strong convexity parameter, and the size of the uniform 
distribution employed for smoothing, respectively. 



A. Examples 

1) A stochastic utility problem: Consider the following optimization problem. 



mm 

xex 




(46) 



where X = {x e R^\x > 0^ Yl7=i ~ independent and normally distributed random variables with mean 

zero and variance one. The function (/){■) is a piecewise linear convex function given by (^(t) = maxi<i<^{v^+5it}, 
where Vi and Si are constants between zero and one, and = 0(Xir=i(n ^ ^0^0)- To apply our schemes, 

we require strong convexity of function /. Therefore, we regularize / by adding the term f to / where > 
is the strong convexity parameter. We now apply the randomized smoothing technique discussed in Section V-C. 
Smoothed regularized problem given by 



min < fix) = E 



(47) 



where z G is the uniform distribution on a ball with radius e with independent elements Zi, 1 < i < n. We let 
X* denote an optimal solution of problem (46) and ^ be the unique optimal solution of problem (47). To find 
optimal solutions, we use an SAA method [18] which leads to linear and a quadratic program for solving problem 
(46) and problem (47), respectively. 

Table I shows the results of parametric analysis of the simulation of our schemes on problem (47). The table 
is partitioned into three parts, each corresponding to a variation of parameters n, N, r], respectively. In each part, 
one parameter has been assigned three increasing values while the other parameters are kept fixed, allowing us to 
ascertain the impact of each parameter on the performance of the schemes. We generated 50 trajectories of the 
RSA and CSA scheme for a given n, A^, r^, e. Over these realizations, we computed the means and 90% confidence 
intervals. The baseline parameters are chosen as n = 20, A^ = 4000, e = 0.5, and = 0.5 as a reference 
for each group. Note that in Table I, the confidence intervals employ the logarithm of the error. Recall that we 



have a theoretical upper bound on the error E[\\xk 



as given by (8) and (17) for the RSA and CSA 



schemes. Additionally, we obtain an empirical error bound based on using the scheme in practice. Insights: We 
observe that the confidence intervals of both the CSA and the RSA schemes are relatively invariant to changes in 
problem dimension. Furthermore, RSA appears to have provide slightly tighter intervals in comparison with CSA. 
Expectedly, increasing N leads to significant improvement in these intervals while larger values of r] lead to less 
accurate solutions (with respect to the unregularized problem) but tighter bounds. Moreover, the CSA schemes in 
particular give better confidence bounds than RSA when r] is larger. 





P(^) 




N 


e 


V 


HSA - 90% CI 


RSA - 90% CI 


CSA - 90 


Vo CI 




n 


1 


10 


4000 


5 


Oe- 


1 


5 


Oe-1 


[1.00e+0,1.01e+0] 


[1.58e-3,1.96e-3] 


[1.47e-3, 1 


93e-3] 


3.28e-2 




2 


20 


4000 


5 


Oe- 


1 


5 


Oe-1 


[1.03e+0,1.04e+0] 


[1.74e-3, 2.21e-3] 


[ 1.49e-3, 1 


.88e-3] 


1.84e-2 




3 


40 


4000 


5 


Oe- 


1 


5 


Oe-1 


[1.03e+0,1.04e+0] 


[2.21e-3, 2.54e-3] 


[2.24e-3, 2 


74e-3] 


6.49e-2 


N 


4 


20 


1000 


5 


Oe- 


1 


5 


Oe-1 


[1.05e+0,1.05e+0] 


[3.76e-3, 4.74e-3] 


[4.67e-3, 5 


96e-3] 


1.84e-2 




5 


20 


2000 


5 


Oe- 


1 


5 


Oe-1 


[1.04e+0,1.05e+0] 


[2.86e-3, 3.63e-3] 


[2.78e-3, 3 


57e-3] 


1.84e-2 




6 


20 


4000 


5 


Oe- 


1 


5 


Oe-1 


[1.03e+0,1.04e+0] 


[1.74e-3, 2.21e-3] 


[1.49e-3, 1 


88e-3] 


1.84e-2 


V 


7 


20 


4000 


5 


Oe- 


1 


2 


5e-2 


[1.13e+0,1.13e+0] 


[2.77e-3, 3.48e-3] 


[2.73e-3, 3 


51e-3] 


9.63e-3 




8 


20 


4000 


5 


Oe- 


1 


5 


Oe-1 


[1.03e+0,1.04e+0] 


[1.74e-3, 2.21e-3] 


[1.49e-3, 1 


88e-3] 


1.84e-2 




9 


20 


4000 


5 


Oe- 


1 


1 


Oe+0 


[0.83e+0,0.84e+0] 


[9.70e-4, 1.21e-3] 


[1.07e-3, 1 


30e-3] 


4.52e-2 



TABLE I: Stochastic utility problem: HSA, RSA, CSA 
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2) A bilinear matrix game problem: We consider a bilinear matrix game, 

min max v"^^^;, (48) 

where X = Y = {x eR"" : 

> 0}. Furthermore, A is a symmetric matrix whose entries are 
Aij = l<ij<n. (49) 

Problem (48) a saddle point problem. Solving saddle point problems by SA algorithm has been discussed extensively 
(cf. [72]). The gradient and its sampled variant to be employed in algorithm (3) are given by: 

-2 )• ) • 

respectively where and /(x,^2) are random integers between 1 and n with probabilities 

-min(0,yi,...,yn) . . -min(0,xi,...,Xn) ^ 



Ej=i(^j -min(0,yi,...,?/n))' ~ ~ ' -min(0,xi,...,Xn))' 
respectively for arbitrary vectors x and We generate these random variables through two independent random vari- 
ables ^1 and ^2 which are uniformly distributed in [0, 1]. Now, for any (x, y) e X xY, since min(0, Xi, . . . , x^) = 
min(0, . . . , ^n) = 0, and Y.7=i = Z)i=i Vj = 1' we have 




implying that has zero-mean, i.e., E[wk \ Tk] = for all k >0. To analyze the behavior of the upper bound of 
error arising from RSA and CSA, we need a strongly convex function. This is obtained by adding a regularization 
term — to the function Ax which makes it a strongly convex function with respect to x and a 

strongly concave function with respect to y. To apply the randomized technique in Section V, we consider an 
(2n)-dimensional ball with radius e uniformly distributed. We use the following SA algorithm to find the solution 
to an approximate solution of (48): 

Xk+i = nx [xk - lk{G{xk + Ct Vk + C2^ KVk + C2^ ^1 )) + ^{xk + cf ))] for all /c > 0, ^^^^ 

Vk+i = ny [yk + lk{G{xk + , Vk + C2^ Kxk + Cf , ^2 )) - viVk + Cf ))] for all /c > 0, 

where Ci ^ and C2 ^ are random vectors with uniform distribution in the (n + m)-dimensional ball with 

radius e. 

From the structure of A in (49), it is observed that the optimal solution of problem (48) is obtained for x* = 
[1, 0, . . . , 0]^ and = [0, . . . , 0, 1]^. This result can also be obtained quite simply by using a linear programming 
reformulation. The regularized problem cannot be analyzed as easily and its solution can be obtained by using QP 
duality and SAA techniques. 

Table II presents the results of simulations for RSA and CSA schemes. Similar to the Table I, there are three parts 
in the Table II for the parameters. For this problem, — is very small and shows that the optimal solution 
of the approximate problem is very close to the optimal solution of problem (48). We set n = 20, N = 4000, 
6 = 0.2, and r] = 0.01 as the reference setting. Figure 3b shows the theoretical upper bounds and the mean of 
samples of simulation for RSA and CSA schemes. 

Insights: Unlike in the stochastic utility problem, in this instance, the true optimal solution is obtained within 
the N gradient steps for most of the test problems. However, it should be remarked that the CSA appears to find 
solutions faster than RSA, in at least three of the problems (P(i): 3, 5 and 7). 
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P(i) 




N 






HSA - 90% CI 


RSA - 90% CI 


CSA - 90% CI 






3 


10 
20 
40 


4000 
4000 
4000 


2 . Oe — 1 
2.0e-l 
2.0e-l 


1 . Oe — 2 
l.Oe-2 
l.Oe-2 


[1.92e-|-0, 1.92e-|-0] 
[1.92e+0, 1.92e+0] 
[1.92e+0, 1.92e+0] 


[S.OOe— 12, S.OOe— 12] 
[8.00e-12, 9.00e-12] 
[9.82e-2, 9.82e-2] 


[2.00e— 12, 2.00e— 12] 
[5.50e-10, 5.76e-10] 
[3.55e-9, 3.70e-9] 


O.OOe— 12 
O.OOe-12 
O.OOe-12 


N 


4 
5 
6 


20 
20 
20 


1000 
2000 
4000 


2.0e-l 
2.0e-l 
2.0e-l 


l.Oe-2 
l.Oe-2 
l.Oe-2 


[1.92e+0, 1.92e+0] 
[1.93e+0, 1.93e+0] 
[1.92e+0, 1.92e+0] 


[2.79e-l, 2.79e-l] 
[1.07e-l, 1.07e-l] 
[8.00e-12, 9.00e-12] 


[1.12e-l, 1.12e-l] 
[5.37e-10, 5.77e-10] 
[5.50e-10, 5.76e-10] 


O.OOe-12 
O.OOe-12 
O.OOe-12 


V 


7 
8 
9 


20 
20 
20 


4000 
4000 

4000 


2.0e-l 
2.0e-l 
2.0e-l 


5.0e-3 
l.Oe-2 
2.0e-2 


[1.96e+0, 1.96e+0] 
[1.92e+0, 1.92e+0] 
[1.84e+0, 1.84e+0] 


[1.13e-l, 1.13e-l] 
[8.00e-12, 9.00e-12] 
[1.07e-10, 1.46e-10] 


[-1.15e-10, 2.51e-10] 
[5.50e-10, 5.76e-10] 
[3.29e-9, 3.55e-9] 


O.OOe-12 
O.OOe-12 
O.OOe-12 



TABLE II: Bilinear matrix game problem: HSA, RSA, CSA 



3) A stochastic network utility problem: In this example, we consider a spatial network and consider the associated 
network utility maximization problem (See [73], [74]). Suppose that there are n users and Li links. The overall 
network maximization problem is characterized by an objective that is a sum of user-specific concave utilities less 
a congestion cost, which is given by a function of aggregate flow over a link. Let Xi denote the ith user's flow rate 
while Fi{x.^) denotes its utility function, defined by 

Fi{xi,^i) = -ki{^i)log{l^Xi), 



where ki{^i) is an uncertain parameter. Suppose that A denotes the adjacency matrix that captures the set of links 
traversed by the traffic. More precisely, for every link / G C and user i, we have An = 1 if link / carries flow of 
user i and An = otherwise. The congestion cost is given by c{x) = The total cost at the network level 

us then given by 

N 

|2 



Therefore 



VF(x,0 



V 



2A^Ax. 



J 



We assume that the user traffic rates are restricted by a capacity constraint Ax < C. Since the objective function 
F is smooth, there is no requirement to introduce an additional smoothing. 

Table III shows the results of simulations for HSA, RSA, and CSA scheme. Here, we assume that Cs = 
(0.10,0.15,0.20,0.10,0.15,0.20,0.20,0.15,0.25) = 0.75^2 = 0.5Ci and x is constrained to be nonnegative. We 
also assume that ki{^i) is drawn from uniform distribution [/nz(0.2, 1) for every user. The confidence intervals for 
the normed error between the terminating iterate and the optimal solution are reported for each problem. 

Insights: We observe that both RSA and CSA schemes perform favorably in comparison with the HSA scheme. 
Importantly, neither scheme appears to deteriorate from a confidence interval standpoint when the problem size 
grows. Similar to the earlier examples, CSA appears to have slightly tighter confidence intervals in the empirical 
tests that we carried out. 



N 



HSA - 90% CI 



RSA - 90% CI 



CSA - 90% CI 



4000 
4000 
4000 



_C3_ 



[1.58e-2,1.89e-2] 
[1.16e-2,1.38e-2] 
[9.08e-3,1.09e-2] 



[5.57e-3. 
[4.47e-3. 
[4.30e-3. 



,6.81e-3] 
,5.86e-3] 
,5.32e-3] 



[3.65e-3, 
[3.62e-3, 
[3.62e-3, 



4.55e-3] 
4.52e-3] 
4.52e-3] 



4000 
4000 
4000 



C3 
C3 



[9.08e-3,1.09e-2] 
[1.09e-2,1.31e-2] 
[1.04e-2,1.24e-2] 



[4.30e-3. 
[4.80e-3. 
[5.21e-3. 



,5.32e-3] 
,5.94e-3] 
,6.36e-3] 



[3.62e-3, 
[4.09e-3, 
[3.76e-3, 



4.52e-3] 
5.08e-3] 
4.63e-3] 



1000 
2000 
4000 



C3 
C3 
C3 



[8.98e-3,1.07e-2] 
[9.70e-3,1.16e-2] 
[9.08e-3,1.09e-2] 



[6.63e-3. 
[5.65e-3. 
[4.30e-3. 



,7.93e-3] 
,6.88e-3] 
,5.32e-3] 



[5.36e-3, 
[5.32e-3, 
[3.62e-3, 



6.43e-3] 
6.50e-3] 
4.52e-3] 



TABLE III: Stochastic network utility problem: HSA, RSA, CSA 
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B. Interpretation of numerical results 





V 








Theoretical UB of HSA 

Mean of HSA 

Theoretical UB of RSA 

Mean of RSA 







2500 3000 3500 



Iteration's number 



(a) Utility Problem (b) Bimatrix Game (c) Network Luliiy Problem Game 

Fig. 3: Theoretical and empirical error bounds for RSA and CSA schemes. 

In this section, we interpret the numerical results obtained in the previous subsections, focusing on a comparison 
between the theoretical and empirical results and the sensitivity of the schemes to the algorithm parameters. 

1) Theoretical and empirical trajectories: In Figures 3a, 3b and 3c, we provide schematics of the trajectories 
associated with the theoretically obtained upper bounds and the empirical means. Several observations can be 
inmiediately made. In the context of the stochastic utility problem and the network utility maximization problem, 
we observe that the RSA scheme displays uniformly better theoretical bounds, in comparison with CSA. It is also 
worth emphasizing that the "jumps" seen in the theoretical error bound trajectories of CSA correspond to junctures 
where the steplengths drop. In fact, the cascading nature is also apparent in the empirical trajectories of the network 
utility maximization game in Fig 3c, albeit in a less obvious fashion. We observe that the overall empirical behavior 
of both schemes is similar in terms of the final errors for the utility and network utility maximization problems 
while in the context of the bimatrix game, the CSA scheme performs significantly better for a subset of problems. 



-■THUB-2 
"Mean-2 
"■ThUB-3 

Mean-3 



2000 2500 3000 3500 




""ThUB-1 
" Mean-1 
- Th UB-2 

Th UB-3 
Mean-3 



Number of iterations (N) 



(a) HSA (b) RSA 

Fig. 4: The stochastic utility problem: HSA, RSA, CSA 



(c) CSA 
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2) Sensitivity to algorithm parameters: Finally, in this section, we discuss the sensitivity of each scheme to 
algorithm parameters and provide a comparison with a standard stochastic approximation scheme where we assume 
that the stepsize is 7/e = ^ for /c > 1 and a > 0. In HSA, we intend to examine the effect of choosing different 
values of a on the performance of the SA algorithm. In the RSA scheme, we have a choice of the first stepsize 
r^RSA parameter c in the inequality of Proposition 3. We set c = 0.5 and examine the impact of changing 

r^RSA^ Finally, the CSA scheme performs differently with different choices of the cascading parameter < ^ < 1. 
We consider three different values for each of a, 70^*^^, and and present simulations for HSA, RSA and CSA in 
the case of the stochastic utility problem. The reference setting is specified by n = 20, A/" = 4000, e = 0.5, and 
7^ = 0.5. Now suppose a, 70^*^^, and 9 are set as follows: 

a = 1, 0.5, and 0.25; 7^*^^ = 1, 0.5, and 0.25; = 0.75, 0.5, and 0.25. 

Figure 4 shows the simulations for the specified parameters. Note that "Th. UB" shows the corresponding theoretical 
upper bound of each scheme and "Mean" shows the mean of error \\zk — ^e^r^lP where z = (x, y). 

Figure 4a shows the harmonic scheme with a =1, 0.5, and 0.25 corresponding to labels 1, 2, and 3 in the legend. 
This shows that the performance of HSA is extremely sensitive to the choice of a and HSA implementations with a 
larger a performed better for the stochastic utility problem. Furthermore, the error on termination of HSA schemes 
can vary by nearly a factor of 10 for the problems that we tested. The update rules in the RSA schemes rely on r] 
and L with 70^*^^ being the sole user input. Yet, when examining the sensitivity of the RSA scheme to the choice 
of 70^*^^ (see Figure 4b with Jq^^ =1, 0.5, and 0.25 corresponding to labels 1, 2, and 3), we observe that the 
performance is relatively insensitive to the choice of initial stepsize. In effect, the modeler can be relatively less 
concerned about such parameters when attempting to solve this class of problems. Importantly, both theoretical and 
numerical aspect of RSA have almost the same performance for three values of 70^*^"^. Finally, a concern in the 
implementation of CSA schemes is the choice of 0, the cascading parameter where 6 G (0, 1). Figure 4c shows the 
simulation of the cascading scheme with 6 =0.75, 0.50, and 0.25 corresponding to labels 1, 2, and 3. Theoretically, 
we observe that smaller values of (more aggressive reductions in stepsize) lead to slightly superior theoretical 
bounds but not significantly so. However, the results are far more muted when conducting an empirical examination. 
In particular, we observe that the CSA scheme appears to be relatively insensitive to diversity in the choice of 6. 
The relative robustness of the RSA and CSA schemes to the choice of parameters is seen as a crucial advantage 
of such schemes. 

VII. Concluding remarks 

This paper is motivated by two shortcomings associated with standard stochastic approximation procedures for 
stochastic convex programs. First, standard implementations of such schemes provide little guidance in specifying 
parameters that may prove crucial in practical performance. Furthermore, direct extensions to nonsmooth regimes 
of such schemes is not immediate. Accordingly, this paper makes two sets of contributions. First, we develop two 
sets of adaptive steplength schemes and provide the associated global convergence theory. Of these, the former, a 
recursive steplength scheme (RSA), specifies the steplength at a particular iteration using the previous steplength and 
certain problem parameters. The second scheme, called a cascading steplength scheme (CSA), differs significantly 
and is essentially a sequence of constant steplength schemes in which the steplength is reduced at specific points 
in time. The second set of contributions extends these techniques to settings where the objective is not necessarily 
differentiable. Through the use of a local smoothing method that perturbs the problem through a uniformly distributed 
random variable, we propose a stochastic gradient scheme. Notably, Lipschitz bounds are obtained for the gradients 
and their growth with problem size is found to be modest. Locally smoothed variants of the RSA and CSA scheme 
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were seen to perform well on two classes of nonsmooth stochastic optimization problems and implementations were 
seen to be relatively insensitive to problem parameters. 
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